ML Course Part 2 - Introduction to Deep Learning

Alexandre Bry

Introduction

Definitions (Reminders of Previous Session)

Categories of ML

Type of dataset

  • Supervised: for each input in the dataset, the expected output is also part of the dataset
  • Unsupervised: for each input in the dataset, the expected output is not part of the dataset
  • Semi-supervised: only a portion of the inputs of the dataset have their expected output in the dataset
  • Reinforcement: there is no predefined dataset, but an environment giving feedback to the model when it takes actions

Type of output

  • Classification: assigning one (or multiple) label(s) chosen from a given list of classes to each element of the input
  • Regression: assigning one (or multiple) value(s) chosen from a continuous set of values
  • Clustering: create categories by grouping together similar inputs

Dataset

Dataset

Dataset

A collection of data used to train, validate and test ML models.

Content

Instance (or sample)

An instance is one individual entry of the dataset.

Feature (or attribute or variable)

A feature is a type of information stored in the dataset about each instance.

Label (or target or output or class)

A label is a piece of information that the model must learn to predict.

Subsets

Dataset subsets

A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:

  • Training set: used during training to train the model,
  • Validation set: used during training to assess the generalization capability of the model, tune hyperparameters and prevent overfitting,
  • Test set: used after training to evaluate the performance of the model on new data it has not encountered before.

Structure of NN

Definitions

Neural Network

Neural Network (NN)

Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.

Common representation of a neural network[1]

Deep Learning

Deep Learning (DL)

Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.

Structure of a NN model

Basic Elements

Neuron

Takes multiple inputs, sums them with weights and passes the result as output.

Layer

Set of similar neurons taking different inputs and/or having different weights.

Neural Network (NN)

Sequence of layers.

Linear Functions

Linear Function

Function that can be written like this: \[ f(\alpha_1, \cdots, \alpha_n) = (\beta_{1,1} \alpha_1 + \cdots + \beta_{1,n} \alpha_n, \cdots, \beta_{m,1} \alpha_1 + \cdots + \beta_{m,n} \alpha_n) \]

Composition of Linear Functions

The composition of any number of linear functions is a linear function.

Activation Functions - Definition

Activation Function

Function applied to the output of a NN layer (i.e. to the output of each of its neurons) to introduce non-linearity to the model.

Activation functions allow to approximate much more complex functions, using a sequence of intertwined affine layers and activation layers.

Activation Functions - Examples

Rectified linear unit (ReLU)[2]

Hyperbolic tangent (tanh)[3]

Logistic, sigmoid, or soft step[4]

Leaky rectified linear unit (Leaky ReLU)[5]

Affine layers

Fully Connected

The most basic layer, in which each output is a linear combination of each input (before the activation layer)

Fully Connected Layer[6]

Fully Connected Layer[6]

Convolutional

A layer combining geographically close features, used a lot to process rasters.

2D Convolutional Layer[7]

2D Convolutional Layer[7]

Recurrent

Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.

The two main variants of recurrent layers are:

Long Short-Term Memory (LSTM)[8]

Gated Recurrent Unit (GRU)[9]

Nowadays, transformer architectures are however preferred to process sequential data.

Pooling

A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.

Max Pooling Example[10]

Residual

A Residual Block aims at stabilizing training and convergence of deep neural networks (with a large number of layers), by adding the input of a given layer to the output of another layer further down in the architecture.

Residual Block skipping two layers[11]

Attention

Attention aims at determining relative importance of each part of the input to make better predictions. It is used a lot in natural language processing (NLP) and image processing.

Attention mechanism in seq2seq with RNN[12]

And a lot more…

  • Dropout: randomly drop out some of the nodes during training to reduce overfitting
  • Batch Normalization: normalize the input of each layer across the batch to improve training stability and speed
  • Layer Normalization: normalize the input of each layer across the features to improve training stability and speed
  • Embedding: transforms discrete input data into continuous vectors with lower-dimensional space
  • Flatten: convert multi-dimensional data into 1D data that can be fed into fully connected layers

Architectures of NN

Fully Connected Network (FCN)

Fully connected network[13]

Convolutional Neural Network (CNN)

Standard for image processing, with 2D convolutional layers followed by fully connected layers.

1D convolutional neural network[14]

A Lot More

  • Recurrent Neural Network (RNN)
  • Generative Adversarial Network (GAN)
  • Autoencoder Network
  • Transformer Network

Examples

Model

Example of fully connected network made with NN-SVG

PyTorch

import torch
import torch.nn as nn

# Simple fully connected model with 2 hidden layers
class SimpleMLP(nn.Module):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        self.fc1 = nn.Linear(2, 20)   # Input layer to 1st hidden layer
        self.fc2 = nn.Linear(20, 10)  # 1st hidden layer to 2nd hidden layer
        self.fc3 = nn.Linear(10, 1)   # 2nd hidden layer to output layer

    def forward(self, x):
        x = torch.relu(self.fc1(x))   # ReLU activation after first layer
        x = torch.relu(self.fc2(x))   # ReLU activation after second layer
        x = self.fc3(x)               # Output layer (no activation for regression tasks)
        return x

Tensorflow + Keras

import tensorflow as tf
from tensorflow.keras import layers, Model

# Simple fully connected model with 2 hidden layers
class SimpleMLP(Model):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        self.fc1 = layers.Dense(20, activation='relu')  # Input layer to 1st hidden layer
        self.fc2 = layers.Dense(10, activation='relu')  # 1st hidden layer to 2nd hidden layer
        self.fc3 = layers.Dense(1)  # 2nd hidden layer to output layer (no activation for regression)

    def call(self, inputs):
        x = self.fc1(inputs)
        x = self.fc2(x)
        return self.fc3(x)

Optimization

Why Optimization?

The number of possible combinations of parameters is huge, even with small NN models. To (hopefully) find the best possible combination, we need two things:

  • A way to evaluate any combination
  • A way to find well-performing combinations

Loss Function

Definition

A loss function is a mathematical function that quantifies the difference between the network’s predicted output and the actual target values. The goal during training is to minimize this loss by adjusting the model’s weights, using gradient descent.

The most common loss functions are:

  • Mean Squares Error (MSE) for regression
  • Cross-entropy Loss for classification

Differentiable

To be able to perform gradient descent, the loss function must be differentiable, which means continuous (no jump) and smooth (no sudden change of direction).

Examples of differentiable functions[15]

Examples of non-differentiable functions[15]

Convex

To get the best results when performing gradient descent, it is also better if the function is convex. The simplest definition of convexity is that if you trace a straight line between two points on the curve, the curve will be below the segment between the two points.

Example of convex and non-convex functions[15]

Gradient Descent

Definition

Gradient Descent

The process of iteratively computing and following the direction of the gradient of a function to (hopefully) reach the minimum value of the function (if it exists).

Gradient Descent works because at any point in the definition space of the function, the gradient points in the direction of the steepest angle. So locally, following this direction is the quickest way to get to the lower value of the function. If we come back to the requirements listed before:

  • Differentiable functions are functions where the gradient exists everywhere
  • Convex functions are convenient for gradient descent because they have only one minimum value and slowly going down the function will always lead to the minimum value.

Algorithm

Gradient Descent boils down to iteratively:

  1. Compute the gradient of the loss function at the current point
  2. Make a step towards the direction of the gradient to a new point
  3. Repeat step 1 until we stop

In this process, the three things that have to be defined are:

  • The starting point (weights initialization)
  • The size of the steps (learning rate)
  • The condition to stop

Weights initialization

The starting point is defined by the first output of the model, and therefore by the initial values of the weights of the model. There are numerous methods to initialize the weights, but the most common one is to randomly initialize them using a centered and normalize Gaussian distribution.

Learning rate

The gradient gives us a direction and a norm, but this norm is arbitrary and has to be rescaled using what we call the learning rate. The learning rate doesn’t define the size of the steps, but the scalar factor to apply to the gradient’s norm, which means that the norm still plays a crucial role.

The choice of the learning rate is crucial to hopefully converge quickly to the global minimum loss.

Example of gradient descent on the same function with different learning rates[15]

Stop condition

The stop condition determines when you decide to stop the algorithm. An easy solution is to choose a number of steps before launching the algorithm, but this will either imply useless computations after the algorithm has reached a final point, or stopping too early and not get the best results possible.

Therefore, although there are more complex methods, the most common and simple process is to monitor the value of the loss, memorize the lowest value ever reached, and stop when there has been a given number of steps without any improvement to the best value. Then, we usually keep the model weights corresponding to this best value.

Unlucky examples

Examples with a saddle point[15]

Animated example with a saddle point[15]

Backpropagation

Why?

Gradient Descent is beautiful, but right now, we only know in which direction (the gradient) the output of the model should go. To transmit this information to the weights of the layers of the model, we use backpropagation.

Definition

Backpropagation

The process of computing the gradient of the weights of each layer of the model and modify them accordingly. The name comes from the process starting with the last layer of the model and propagating incrementally to the first layer.

How?

Backpropagation involves a lot of computations of partial derivatives, which are individually not difficult but very bothersome. Happily, NN libraries handle backpropagation automatically by calling only one function, so no need to worry about it.

Examples

PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Generate some simple data
X = torch.randn(100, 2)  # 100 samples, 2 features
y = (2 * X[:, 0]**3 + 0.5 * X[:, 1]**2 + torch.randn(100) * 0.5).unsqueeze(1)

# Create a DataLoader for batching
batch_size = 32
dataloader = DataLoader(TensorDataset(X, y), batch_size=batch_size, shuffle=True)

# Instantiate the model
model = SimpleMLP()  # A simple model that is assumed to be defined before

# Define a loss function and optimizer
criterion = nn.MSELoss()  # Mean Squared Error Loss
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Stochastic Gradient Descent

# Training loop
epochs = 1000
epoch_losses = []  # List to store average loss at each epoch

for epoch in range(epochs):
    epoch_loss = 0.0  # Accumulate loss for this epoch
    for batch_X, batch_y in dataloader:
        # Forward pass
        predictions = model(batch_X)  # Compute the model's predictions
        batch_loss = criterion(predictions, batch_y)  # Compare predictions to true values
        
        # Backward pass and optimization
        optimizer.zero_grad()  # Reset gradient
        batch_loss.backward()  # Backward pass
        optimizer.step()  # Update parameters

        # Accumulate weighted loss for the batch
        epoch_loss += batch_loss.item() * batch_X.size(0)
    
    epoch_loss /= len(dataloader.dataset)  # Normalize by dataset size to get average epoch loss
    epoch_losses.append(epoch_loss)  # Store the epoch's total loss

    # Print loss every 10 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch + 1:>4}/{epochs}, Loss: {epoch_loss:.4f}")

Tensorflow + Keras

import tensorflow as tf
from tensorflow.keras import layers, Model
import numpy as np

# Generate some simple data
X = np.random.randn(100, 2).astype(np.float32)  # 100 samples, 2 features
y = (2 * X[:, 0]**3 + 0.5 * X[:, 1]**2 + np.random.randn(100) * 0.5).astype(np.float32)
y = y.reshape(-1, 1)  # Reshape y to (100, 1)

# Create a Dataset for batching
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((X, y)).batch(batch_size)

# Instantiate the model
model = SimpleMLP()  # A simple model that is assumed to be defined before

# Define a loss function and optimizer
loss_fn = tf.keras.losses.MeanSquaredError()  # Mean Squared Error Loss
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)  # Stochastic Gradient Descent

# Training loop
epochs = 1000
epoch_losses = []  # List to store average loss at each epoch

for epoch in range(epochs):
    epoch_loss = 0.0  # Accumulate loss for this epoch
    for batch_X, batch_y in dataset:
        # Forward pass
        with tf.GradientTape() as tape:
            predictions = model(batch_X)  # Compute the model's predictions
            batch_loss = loss_fn(batch_y, predictions)  # Compare predictions to true values
        
        # Backward pass and optimization
        gradients = tape.gradient(loss, model.trainable_variables)  # Backward pass
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))  # Update parameters

        # Accumulate weighted loss for the batch
        epoch_loss += batch_loss.numpy() * len(batch_X)
    
    epoch_loss /= len(X)  # Normalize by dataset size to get average epoch loss
    epoch_losses.append(epoch_loss)  # Store the epoch's total loss

    # Print loss every 10 epochs
    if (epoch + 1) % 100 == 0:
        print(f"Epoch {epoch + 1:>4}/{epochs}, Loss: {epoch_loss:.4f}")

Transfer Learning

Introduction

Principle

Start Closer to the Goal

The basic idea of Transfer Learning is to use a model that was pre-trained on a similar task. This initial model should have learnt basic knowledge that is common to its initial task and our new task, making it capable of learning the new task faster.

Reasons

  • Limited amount of data for the new task
  • New task is similar to old task
  • Training a model is very costly
  • Better final performance with less overfitting

Applications

Two major applications:

  • Computer Vision
  • Natural Language Processing

Categories

Fine-tuning

Idea

Take a pre-trained model as is, freeze some of the layers (usually the first layers) and continue the training where it was stopped with our new dataset.

Challenges

  • Find a good pre-trained model
  • Have a proper dataset (even if fine-tuning works with smaller datasets)
  • Freeze the right number of layers (more if the tasks are similar)
  • Train the model properly (useful to know how it was pre-trained)

Feature Extraction

Multitask Learning

Knowledge Distillation

Python Libraries

PyTorch

  • Website:
  • Flexibility, ease of debugging, beginner-friendly
  • Favored in academia and by researchers

TensorFlow

  • Website:
  • Scalable, production-ready and easy deployment
  • Favored in production and industry

Keras

  • Website:
  • User-friendly, high-level and powerful with TensorFlow integration
  • Favored for first hands-on experience for beginners and for prototyping

Resources

Miscellaneous

Playground

Playground

YOLOv8 Model

The architecture of YOLOv8[16]

References

Introduction to Deep Learning

[1] User:Wiso. “Neural network example.” Available at: https://commons.wikimedia.org/w/index.php?curid=5084582.
[2] Laughsinthestocks. “Rectified linear unit (ReLU).” Available at: https://commons.wikimedia.org/w/index.php?curid=44920600.
[3] Laughsinthestocks. “Hyperbolic tangent (tanh).” Available at: https://commons.wikimedia.org/w/index.php?curid=44920568.
[4] Laughsinthestocks. “Logistic, sigmoid, or soft step.” Available at: https://commons.wikimedia.org/w/index.php?curid=44920533.
[5] Laughsinthestocks. “Leaky rectified linear unit (leaky ReLU).” Available at: https://commons.wikimedia.org/w/index.php?curid=46839644.
[6] Diego Unzueta. “Fully connected layer.” Available at: https://builtin.com/machine-learning/fully-connected-layer.
[7] Diego Unzueta. “2D convolutional layer.” Available at: https://builtin.com/machine-learning/fully-connected-layer.
[8] fdeloche. “Long short-term memory (LSTM).” Available at: https://commons.wikimedia.org/w/index.php?curid=60149410.
[9] fdeloche. “Gated recurrent unit (GRU).” Available at: https://commons.wikimedia.org/w/index.php?curid=60466441.
[10] Daniel Voigt Godoy. “Max pooling example.” Available at: https://github.com/dvgodoy/dl-visuals/.
[11] LunarLullaby. “Residual block skipping two layers.” Available at: https://commons.wikimedia.org/w/index.php?curid=131458370.
[12] Google. “Attention mechanism in seq2seq with RNN.” Available at: https://github.com/google/seq2seq.
[13] Sijie Yang, Fei Zhu, Xinghong Ling, Quan Liu, Peiyao Zhao. “Intelligent health care: Applications of deep learning in computational medicine.” In: Frontiers in Genetics. Available at: https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.607471.
[14] Vicente Oyanedel M. “Own work.” Available at: https://commons.wikimedia.org/w/index.php?curid=95810663.
[15] Robert Kwiatkowski. “Gradient descent algorithm: A deep dive.” Available at: https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21.
[16] Ultralytics. “The architecture of YOLOv8.” Available at: https://github.com/ultralytics/ultralytics/issues/189.